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Abstract 

Graph partitioning is a topic of extensive interest, with applications to parallel processing. In 
this context graph nodes typically represent computation, and edges represent communication. 
One seeks to distribute the workload by partitioning the graph so that every processor has 
approximately the same workload, and the communication cost (measured as a function of 
edges exposed by the partition) is minimized. Measures of partition quality vary; in this paper 
we consider a processor’s cost to be the sum of its computation and communication costs, and 
consider the cost of a partition to be the boiileneck, or maximal processor cost induced by the 
partition. For a general graph the problem of finding an optimal partitioning is intractable. 
In this paper we restrict our attention to the class of /:-ary 7i-cube graphs with uniformly 
weighted nodes. Given mild restrictions on the node weight and number of processors, we 
identify partitions yielding the smallest bottleneck. We also demonstrate by example that some 
restrictions are necessary for the partitions we identify to be optimal. In particular, there exist 
cases where partitions that evenly partition nodes need not be optimal. 


*This research was partially supported by the National Aeronautics and Space Administration under NASA 
contract number NASl-19480 while the author was in residence at the Institute for Computer Applications in Science 
and Engineering (ICASE), NASA Langley Research (’enter, Hampton, VA, 23681. 
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1 Introduction 

The problem of assigning workload in a parallel system has long been viewed as important, and 
in the general case, as intractable. A significant amount of research has addressed the pioblem of 
finding good, if not optimal, workload ma])pings; a number of different objective functions have 
been used. All relevant objective functions recognize that the cpiality of both load balance and com- 
munication costs are important. While workload imbalance is generally defined as a large deviation 
between the maximum and average load among processors, treatments of communication costs dif- 
fer. A common technique is to measure the communication cost as the sum of all communication 
induced by the mapping. While this sometimes leads to more tractable treatments (e.g. [8, 12]), it 
does not ca])ture the fact that communication can happen in parallel. An alternative formulation is 
to assess the sum of computation and communication for each processor, and measure the quality 
of the mapping as the maximum processor load, or bottleneck l.f]. The bottleneck measuie does 
not take precedence relationships into consideration, and so is most useful in highly data-parallel 
computations where processors typically cycle through computation and communication phases. 

In this paper we assume that a very regular graph— a k ary n-cube[6j— describes the computa- 
tion and communication needs of a data-parallel problem. Each node in the graph represents some 
piece of computational work, which we assume takes w time to perform. Each edge (i,y ) represents 
some implicit communication necessary between nodes i and jf; typically such an edge reflects a 
data dependency of node i’s computation for the present iteration on the result of executing node 
j in the previous iteration (and vice-versa). The edges may be viewed as communication that 
must occur at the end of an iteration. We desire to partition the graph into p node sets, assigned 
one per processor, so as to minimize the l)ottleneck cost. The problem is not entirely academic. 
Several current parallel architectures have communication topologies based on the k-ary 7)-cube. 
The problem of partitioning a communication topology arises, for instance, when one executes a 
parallel simulation of traffic on a fc-ary 7i-cube network [7, 1]. 

The objective of this paper is to show that under mild restrictions on w and />, the optimal 
partition is intuitive, one that equi-partitions the graph into node sets that are internally clusteied 
as tightly as possible. The main requirement turns out to be that p be large enough relative to the 
size of the fc-ary ?i-cube. The central point of interest is that restrictions on iv and p are needed; 
while intuitive, our results are not at all immediate. We also point out that previous analyses of 
partitioning regular grids differ from the current work in an subtle but important way. It is not 
the objective of the paper to give new partitioning algorithms, but to clarify one’s intuition about 
partitioning A;-ary n-cnbes. 

There are three bodies of work on graph ])artitioning that bear discussion. The technique of 
recursive spectral dissection (e.g., [2]) divides a graph into two pieces, based on an eigenvalue 
analysis of a matrix describing the graph connectivity. The algorithm is applied recursively until 
p = 2^ node sets are defined. Each partition cut is guaranteed to achieve a certain level of load 
balance (not necessarily perfect balance), with a guaranteed upper bound on the number of edges 
cut. Spectral dissection may find some of the partitions we identify as optimal (when k is a ])owei 
of two), but is not guaranteed to find them’. Recursive geometric partitioning (e.g. [9]) is similar 
in spirit, but different in details. A graph in TZ"' is projected onto the unit sphere in 72^ , and the 

projection is stretched to locate the center of mass (approximately) at the sphere’s origin. A great 


’Personal communication from Alex Pothen. 
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circle cut of the sphere partitions the node set into two pieces. The technique also guarantees a 
certain level of load balance and bounds the number of edges cut. Like spectral partitioning, the 
method may find the optimal partitions (in the same special case of k being a power of two), but 
also may not. On the other hand, recursive binary dissection [3] (and its extension, parametric 
recursive binary dissection [4]) will find the partitions we identify as optimal, when k is a power 
of 2. In the case of general graphs there is no such guarantee. The heuristic described in [11] is 
shown there to find optimal partitions of a.nd obvious extensions to heuristic described in [10] 
will find all optimal partitions identified in this paper, provided the correct number of processors 
in each dimension are supplied in the problem description. 


2 Problem Formulation 


A A;-ary n-cube Nk^n is a graph with k”' nodes, with an edge defined between two nodes i and j 
if, in the base-A: number system, the expressions of i and j differ in at most one digit, and differ 
there (modulo k) by exactly 1. Thus, if i = 6,j_i6„_2 • • -6o is the base-A; representation, then in 
each dimension j - I,. . ,,n, i shares an edge with i' = 6,j_i 6„_2 • ■ • (6^ -|- 1 ) mod k bj-\ -bo and 
with i" = 6,i_|6„_2 ■ ■ - (bj — 1) mod k • • - ho- These edges are said to be in dimension j, and 
i' and i" are said to be dimension j neighbors of i. Special cases include rings (Nk,i), hypercubes 
(N2,u), two and three dimension toruses (iV*..,2, ^^.,3). It is useful to imagine as a collection of 
interconnected rings resident in an n-dimensional space. 

A partition of into p subdomains is a collection of nonempty node subsets V = {Pq, . . A’p-i }• 
Abusing usual notation, we’ll denote that an edge e has at least one endpoint in P, by e G Pi, and 
define the indicator function 7(e, /',) to be one if exactly one of e’s endpoints is in P,-, and zero 
otherwise. Then we denote the number of external edges in P, by 

Ext{Pi)= Y.I{e,Pi), 
eePi 

denote the number of internal edges as 


/n<(P,)= Y.{\-I{e,Pi)), 
eeP. 


and define the cost of P{ as 

C{P,)^w\Pi\^Ext{Pi), 

Here we weight the cost of each node by w to reflect the execution cost, where the communication 
cost associated with one edge is unity. The cost of V is taken as 


B{P) = max C(P.). 

0<1<P 

Given p and w, we wish to find the partition V that minimizes B(V). 

A very similar special case of this problem has been studied in the context of partitioning grids 
arising from the discretization of domains for the solution of partial differential equations, by Reed 
et al. [14]. It is instructive to consider the subtle difference in the problem specification, because 
the conclusions reached differ greatly. 

The partitions considered by Reed et al. all tessellate a two-dimensional domain without 

wraparound edges) with a common shape, e.g., rectangles, squares, or hexagons. The computation 
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to coiiiiminication ratio of different shapes are analyzed, but the communication cost is taken as 
the sum (over all grid points in the subgraph) of the cost of communicating each boundary point. 
This may vary from point to point. For instance, Figure 1 illustrates some hexes; point A has two 
edges cut, but since the endpoints of both edges are in the same hex, Reed et ah count the cost as 
one, not two. Point B has two edges cut, but both of these are counted. With this measure, the 
communication cost of a hex is taken as 10 although 14 edges are cut. Shapes like hexagons are 
shown to achieve a better computation/communication ratio than do squares. This is interesting, 
because in this case our results give general conditions under which squares are optimal, a significant 
difference due entirely to a minor change in the model of communication costs. 

Reed et al.’s measure makes sense in its presented context where a specific numerical algorithm 
caUs for the exchange of boundary value grid points. In other contexts unique edges from a node 
represent unique pieces of information, and the cost function we adapt is appropriate. We are aware 
of algorithms in computational fluid dynamics, for instance, where there is a unique “flow” along 
every edge in a mesh. Most of the grid partitioning community counts cut edges. 

While our results identify general conditions under which equi-partitions are optimal for the 
bottleneck measure, it is worthwhile noting that this need not always be the case. An example 
that partitions a 6 x 6 mesh into 3 partition elements is shown in Figure 2. Here the unbalanced 
partition has bottleneck cost 28w + 10, the balanced partition has bottleneck cost 12w + 12. The 
unbalanced partition is better whenever w < 6/19. This example illustrates the tension between 
partitioning to minimize computational imbalance and communication overhead. Our goal is find 
general conditions under which obvious equi-partitions are optimal with respect to the bottleneck 
metric. 
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Unbalanced partition has 
cost 28w+12 



Balanced partition has 
cost 12w+14 


Figure 2: Equal sized partitions need not be optimal 


3 Preliminaries 


We first establish some preliminary results. These depend on k in a way that is captured by defining 
Tk = 1 for k = 2, and = 2 for k > 2. 

Observation 1 Let A be any set of nodes in Nk,n- // M| = m and Int(A) = v, then Ex.t{A) = 
Tf^mn — 2tJ, 

Lemma 2 Let A be any set of nodes in k ^ 3, with \A\ — ni. Then < ( 7 ulog 7 n)/ 2 . 

This bound is achieved when m == 2^ for some j < n. 

Proof: We induct on ni. The base case of m = 1 is trivially satisfied. Suppose then that the claim 

is true for any set of size m — 1 or smaller, and choose any node set A with \A\ = in. Choose any 
two nodes x and y h\ A^ consider their indices expressed in base-Ar notation and find a dimension 
j in which their indices differ in that notation. Let a and b be the dimension j index for or and y 

respectively. Viewing these indices as lying on a “ring” 0-1-2 (A - 1) - 0, cut the ring 

into two sequences of length 2 or greater, one of which contains a, and one of which contains b, 
Paitition A into sets Xa and with Xa comprised of all nodes whose indices in dimension j lie 
in the same range as a’s, and X^^ — A — X^^. Let u and in — u be the number of nodes in X and 
Y respectively. By the induction hypothesis, Xa has no more than (ulogix)/2 internal edges, and 
Xt has no more than {{in - u)log(m - u)/2) internal edges. If A = 2 or if A > 4 there can be no 
more than min{w, - u} edges between Xa and because any such edge has to connect nodes 
whose indices differ only in dimension j, and which must be adjacent on the ring we partitioned. 
Any node in either set can have at most one edge to the other set. It follows that A can have no 
more than 

Bvi{u) = {u\ogu)/2 + ((m - u) log(77i — u))/2 + min{ 7 /, in — u}. 

Now the function 

fm{(]) = (q^loggf)/2 -f ((777 - q)]og{in - q))/2 4- q 

defined over q e [0, 777 / 2 ] completely describes the bound as a function of q = min{ 77 , 777 - u}. 
Considered as a continuous function of q, analysis of derivatives reveals fm{Q) to be convex over 
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[0,?u/2], and is hence maximized at the endpoint q - m/2. Simple algebra shows that Bm(u) < 
/m(”i/2) = (wlogm)/2, completing the induction. Finally, observe that the same argument holds 
in the case of Ar = 2 by relaxing the recpiirement that the dimension j ring be cut into lengths of 
2 or greater— there is only one cut possible, and it is still possible for a node in or Xt, to have 
at most one edge between and Xb- Finally, observe that when m = 2-», j < n, the bound is 
achieved by any set A that forms a y-dimensional hypercube in Nk,n- * 

Another bound is also useful. We will say that set A is nowlien: completed if A contains no 
completed rows, i.e., no dimension j for which there are k nodes whose base-A: indices all agree 
except in dimension j. 

Lemma 3 Let A be any set of nodes in Nk,n, k > 2, with \A\ = m such that A is nowhere 
completed. Then Int{A) < 7 i(m - This bound is achieved whenever k is divisible by q, 

and 771 = {k/qy^. 

Proof: By observation 1 , maximizing Int(A) is equivalent to minimizing Ext(A)\ we seek a set A' 

with m nodes minimizing Ext(A'). A' must be connected, otherwise we could always find a node set 
with smaller external edge count by translating a connected component linearly through Nk.n imtil 
it eliminates one or more external edges by becoming adjacent to another connected component. 
Now represent the set as a “Manhattan polyhedron” (every face is parallel to some axis) formed by 
a collection of unit cubes in 7^’^ each cube representing one node, and two cubes sharing a face if 
there is an edge between the nodes they represent. Figure 3 illustrates this construct. The number 
of external edges is thus equal to the number of exposed faces— the surface area of the Manhattan 
polyhedron. Now the surface area 5,„ of any Manhattan polyhedron in 7^" is at least as large as 
that, say Sr, of the smallest “orthogonal polyhedron” (a rectangular solid in K") that completely 
encloses it. Let v > m be the volume of this orthogonal polyhedron. The polyhedron with volume 
V forming a perfect cube in 7^" has surface area 5, < .S\. But the orthogonal polyhedron with 
volume 777 forming a perfect cube in 7^” has smaller surface area yet. This minimal surface area 
is 27777 i("'^^/’‘ < Ext{A). The claimed bound on [nt{A) follows from observation 1. Furthermore, 
whenever k is divisible by q, and m = {k/qy- we can construct a (k/q) x (k/q) x • --(k/q) cube 
with exactly m nodes, in which case the bounds are exact. * 

Our optimality results hold when the number of nodes in each partition set, m — k /p, is 
small enough to ensure that the optimal partition sets are nowhere completed. Since some internal 
edges are gained by forming a completed row (due to wrap-around), simple extensions to geometric 
arguments like those of Lemma 3 are not sophisticated enough to analyze these tradeoffs. However, 
a simple argument shows that for sets of size m < k, the configuration minimizing external edges 
need not have any completed rows. 

Lemma 4 For all k > 2" and n > 2 there exists a nowhere completed subset of k nodes in 
with minimal external edges. 

Proof: When k > 2” (and n > 2), the single configuration of k nodes that completes a row has 

exactly 2k{n— 1) external edges, whereas the proof of Lemma 3 shows that the set of A' nodes which 
is as cubelike as possible has no more than 277^^""')/’* external edges. Now 277A<""’>/” < 2k{n - 1) 




Node set in 3d cube 
External edges are highlighted 


Manhattan polyhedron 
Exposed faces represent external edges 


Figure 3; Geometric interpretation of a connected node set 


if and only if (1/A;) < (1 - l/n)«. But 1/A.’ < 0.25 for all k > 4, and (1 - 1/n)” increases monoton- 
icaUy in n (converging to c“’) and (1 - 1/2)^ = 0.25. I 

Proofs that optimally configured sets of size m > k may be nowhere completed are beyond the 
scope of this note. However, we can put a lower bound on Ext{A) for \A\ > k, and analyze the 
relative error of this bound. 

Lemma 5 Let k > 4. fo7' all m > 2" and n > 2, let be the minimal value of Ext(A) among 
all node seta A with |i4| = m. Then 

2nm(”-’)/« < E„,„< 2nm("-il/”. 

k ' 


Proof: The upper bound follows from the observation that among all sets A that are nowhere 

completed, 27im^’' *1/” is an upper bound on Ext(A), and thus on E„i^n. The lower bound follows 
by subtracting from this the maximum number of external edges that may be deleted by completing 
a row — two per possible row. g 

Now the relative difference between the upper and lower bound is 1 -m ’/”/(^*^)5 which increases 
in 711. Values of m we are most interested in derive from equi-partitions where every dimension is 
sliced identically. Let q divide k evenly, and let 7ii = (A:/q)”. In this case the relative difference is 
1 -0.5/(774). Consequently the bounds become tighter with increasing dimension size, n, and with 
decreasing partition size set {kjq)'\ 

Let A be any set of nodes with |>1| = m. From the observations above we see that 
C'l (m) = W711 + Tkinn - in log 7ti < C(A) for aU 7 ti = 1 , 2, • • • , A;”, 

and 

62 ( 774 ) = 74)774 + Ti,7nn - 74(774 - 774("“^^/") < C{A) ■ for all 774 = 1, 2, • • • , k. 

Observe that 6-2(774) is monotone non-decreasing, as > 0. Another result describes the 

relationship between 6'i and C2. 
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Lemma 6 Foi- all m G C\(m) > For all m > 2’\ Ci(7>i) < C-zim). 

Proof: Analysis of derivatives with respect to m shows that rj(l) > since f.'i(l) = 62(1) 

we infer that initially, for x > 1 , C',(x) > C^ix). Since both functions are continuous this domi- 
nance is maintained until the first m such that C,(m) = (Mm). Algebra shows that the unique 
solution m > 1 is m = 2 ". At this point C’i( 2 ") < C^( 2 -), and the dominance reverses. ■ 


4 Analysis of Cost Function 

Since both Ci(77i) and 62(771) are lower bounds on C{m), the function 63(771) = ma.x{C^(m),(Mm)} 
is a better composite bounding function. Previous observations have established that 

' 3(”6 I 62(771) for 771 > 2 " 

Furthermore, it is not difficult to show that 6*3(777) is concave over 777 6 [ 1 , 2 ’*], and that C: 3 (»' 0 's 
increasing over 777 G [T\kM Furthermore we also know that when k > 4 , (Mm) is a lower bound 

on the cost of node set A with |A| = 777 < k elements. 

Our strategy now is to identify values of 777 < k for which it is possible to partition into 

A.-’*/777 isomorphic subgraphs, such that 6(777) = 63(777). Since CMm) is known to be increasing for 
m > we determine conditions under which 63(7/1) is increasing over (^onsidere as a 

continuous function, the first derivative of 63(7//) for in € [1/2 ] is 

-^61(777) = w -I- Tfc77 - log 777 - 1 /ln 2 . 
dm 

This function decreases in 777, and so will be non-negative over [ 1 , 2 ’“] if it is non-negative at 777 = 2 ’*. 
The latter condition is satisfied whenever w -t- n{Tk - 1 ) > 1 /ln 2 . Thus 

Lemma 7 Ifw> 1 /ln 2 orifk > 2 and 71 > 1, tlienCMn) is eve 7 -ywherc monoto 7 ie 7ion- decreasing 
over [1, A:’*]. 

Monotonicity of 63(777) can be exploited, for if node sets Pq, ■ - - , Pp -7 I'^ve sizes 7770, . . • , 777^_i , 
then max{6.'3 ( 777o),...,6f3 ( 777p_i)} is minimized when the node sets have equal sizes. To complete 
the analysis we simply identify conditions on p that ensure that 63(777) = 6( f\ ) for all 7 - 0, . . . , p 
1, and that Nk,n can be partitioned into isomorphic node sets with this cost. Such partitions must 

be optimal. 

Theorem 8 The following are optimal partitio 7 is of Nk,n with respect to the bottleneck cost. 

• If so 77 ie co7idition of Le7nma 7 is satisfied, k is even, a 7 id p = A’’*/ 2 ^ with j < n, then Nk,n 
77701 / be. partitioned into iso 77 iorphic hypercubes of dmiexision j . 

• Ifso7ue co7idition of Lc77Wia 7 is satisfied, there is mteger q such that (k/q)'/'^ is mtegc 7 - and 
p - (A;/q)l’*~’l/’*, then Nk,n may be pa 7 -titio 7 ied mto iso 7 norphic blocks of shape (k/q) x 

{k/q)^f'^x---x{klq)^^'\ 
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The partitions identified by this theorem are quite intuitive. They divide uniformly into 
equally sized sets of nodes, and the nodes in a set are clustered tightly. If the number of nodes in the 
set IS less than 2”, the nodes form a hypercube of some dimension no greater than n. If the number 
of nodes exceeds 2” (but is no greater than k), they form a perfect cube in an n-dimensional space. 
However, while these optimal partitions are intuitive, we have already seen that perfectly balanced 
partitions need not be optimal. It is also noteworthy that the requirement on w for optimality 
disappears when p is small enough (p < Ibf"-’)/”), or when k > 2. 

A final result addresses the fact that restricting the number of nodes per processor to k or fewer 

may be overly conservative. For k < m < (fc/2)" we can bound the deviation from optimal of cubic 
equi -partitions. 

Lemma 9 Let q divide k evenly, and consider the partitioning into adjacent blocks of size {k/q) x 
•{k/q). Then the bottleneck cost is no more than 100/(7i^)% larger than optimal. 

Proof: Using m = {k/q) Lemma 5 shows that the increase in external communication cost of the 

cubic partition is no more than \m/{nq)%. g 


5 Conclusions 

fc-ary u-ciibes are regular graph structures that are found in numerous contexts, especially in 
descriptions of communication networks. Partitioning of such graphs is a problem that arises in 
network design, and in parallelized simulation of such networks. This paper examines the problem 
of identifying optimal partitions of Nk,n with respect to the bottleneck metric. Our investigations 
identify two points of interest. First, existing work on partitioning regular graphs for parallel 
processing has used a subtly different measure of communication, which leads to very different 
results than ours. Secondly, while the partitions we identify as optimal are intuitive, we show 
by example that equi-partitions need not always be optimal. Our results then help to delineate 
pioblems with intuitive optimal partitions from those with non-intuitive optiiiicil partitions. 

Open remaining problems that we are pursuing include dealing more conclusively with the effect 

of completing rows, and with determining the minimal value of w ensuring that equi-partitions are 
optimal. 
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